Computer based system for selecting digital media frames

ABSTRACT

A computer based system for selecting digital media frames is capable of predicting the frames that are to be subject to a subsequent action. The subsequent action could be the selection of the predicted frames for inclusion to create a new set of frames consisting of the selected frames; it could also be the selection of the predicted frames for exclusion to create a new set of frames consisting of the frames but now excluding the selected frames. Because the system automatically predicts the frames that are, for example, to be included or excluded in a new clip, this removes the need for the user to manually define start and end frames. Instead, the user merely has to accept the predicted frames or refine the predicted selection. This is far quicker and requires less complex user interaction; these are very important advantages for a system designed for ordinary consumers, as opposed to professional audio or video editors. The system hence finds particular application in consumer oriented devices such as laptop computers, mobile PDAs with wireless connectivity, mobile telephones, set-top boxes; hard-disc based personal video recorders (PVR).

TECHNICAL FIELD

This invention relates to a computer software system for selectingdigital media frames. An end-user performs a subsequent action on theselected frames, such as editing (e.g. selecting some frames only forinclusion and discarding others) and trimming (e.g. discarding start orend frames).

BACKGROUND ART

Application software for editing digital video is an extremelysophisticated and powerful tool because it is primarily designed for,and sold to, the video professional. Such an individual requires accessto many complex functions and is prepared to invest time and effort inlearning to become skilled in their use. Historically, the terminologyand conventions of Digital Editing have evolved from a traditional filmediting environment where rushes are cut and spliced together to tell astory or follow a script. As digital mixer technology advanced newtechniques were combined with these conventional methods to form theearly pioneering software based digital editors.

To the video or film professional editing is second nature and thecomplexities of a time-based media go unnoticed since, having alreadygrasped concepts and learned processes, they are able to concentrate onthe nuances of different editing packages, of which there are many.

Conventionally these packages, through the use of a Graphical UserInterface (GUI), attempt to provide an abstraction of the media in termsof many separate tracks of video and audio. These are represented on theoutput device in symbolic fashion and provision is made for interactingwith these representations using an input device such as a mouse.Typically the purpose is to create a new piece of media as an outputfile, composed by assembling clips or segments of video and audio alonga timeline that represents the temporal ordering of frames. Specialeffects such as wipes and fades can be incorporated, transparentoverlays can be added, colour and contrast can be adjusted. The list ofmanipulations made possible by such tools is very long indeed. A typicalsystem is described in, for example, Foreman; Kevin J., et al,“Graphical user interface for a video editing system”, U.S. Pat. No.6,469,711.

It is possible, however, that an individual who is a consumer of media,rather than a producer, may need to perform a simple editing operationon a media file in order to accomplish their primary task; for exampleto give a multi-media presentation. In this case such tools have theirdrawbacks. They may be too expensive to justify individually, or to haveenough of in order to be available when or where needed. The limitedamount of use and the small fraction of the capabilities used in suchsituations may make them uneconomic. The steep learning curve associatedwith such tools may mean that an inappropriate amount of effort isexpended on something that is not the primary occupation or concern ofthe tool user. For occasional or infrequent use there will be reluctanceon the part of any user repeatedly to switch environments or learn andrelearn new tools to perform simple last minute tasks.

Work has been carried out with the view of improving the interactionbetween a user and a video editor by providing ‘intelligent’ operations.The ‘Silver’ project (Juan P. Casares. “SILVER: An Intelligent VideoEditor.” ACM CHI'2001 Student Posters. Seattle, Wash. Mar. 31-Apr. 5,2001. pp. 425-426) uses ‘smart selection’ to assist the user to find‘in’ and ‘out’ points. The ‘in’ and ‘out’ points are roughly set by theuser and then ‘snap’ to a boundary, which could be a shot change or thesilence between spoken words, or other similar features. Video and audioboundaries typically will not line up so the system provides some‘fixing-up’ functions to smooth the edit boundary.

Conventionally, video editors are application programs that run onhigh-end PCs and workstations, under desktop-oriented operating systemssuch as Microsoft Windows or Apple's Mac OSX, often with high-resolutionscreens and high-bandwidth network connectivity. The viewing of mediafiles, however, can take place on an ever-expanding list of devices withmany different capabilities, such as laptops, mobile PDAs with wirelessconnectivity, mobile phones, set-top boxes and hard-disc based personalvideo recorders (PVRs). The concept of a simple media manipulation toolintegrated into the media player component is as relevant in these casesas it is in that of the standard PC, possibly more so since, forexample, a PVR may not have a run-time environment capable of runningexternal applications such as video editors.

Another class of device that is becoming ever more capable of mediamanipulation is the mobile phone. Such devices now have the ability tocapture, display and transmit moving images, but, conventionally, arenot thought of as a platform for editing video. There is no reason,however, why simple editing operations should not be applied here inorder to enhance even the simplest and shortest of video presentations.Mobile phones present a unique set of challenges to the user interfacecomponent of any application. First and foremost the display area isextremely limited and so immediately rules out multi-level menus,timelines and story-boards. Secondly, the user interface is extremelyconstrained: there is no mouse input, only a few options can bedisplayed at a time, and all interaction must be performed using a setof navigation buttons (which may vary in position and size according tothe hardware manufacturer). Thirdly, the user expects to be able toperform any action one-handed.

Accordingly, these are the attributes of a media frame selection toolthat is appropriate to the needs of such a device.

-   -   Simple and intuitive to use; in particular, little time and        effort is required to learn enough to accomplish the task in        hand.    -   Efficient use of screen area; no menus, timelines or        story-boards.    -   Efficient use of user input interface.    -   Efficient editing model that allows simple trimming operations        to be performed simply, whilst permitting more complex tasks to        be carried out.

SUMMARY OF THE PRESENT INVENTION

In a first aspect, there is a computer based system for selectingdigital media frames, the system being capable of predicting the framesthat are to be subject to a subsequent selection action.

The subsequent selection action could be the selection of the predictedframes for inclusion in a new clip; it could also be the selection ofthe predicted frames for exclusion from a new clip. Once the clip (or anedit list) has been generated, it can be exported.

Because the system automatically predicts the frames that are, forexample, to be included in or excluded from a new clip, this removes theneed for the user to manually define start and end frames; instead, theuser merely has to accept the predicted frames or refine the predictedselection. This is far quicker and requires less complex userinteraction; these are very important advantages for a system designedfor ordinary consumers, as opposed to professional audio or videoeditors. The system hence finds particular application in consumeroriented devices such as laptop computers, mobile PDAs with wirelessconnectivity, mobile telephones, set-top boxes; hard-disc based personalvideo recorders (PVR). The system can also be integrated with a mediaplayer application such that system controls are displayed at the sametime as controls for the media player application are displayed. Theframes can be video and/or audio frames.

The predictive functionality may work as follows: the device holds indevice memory information that defines how a user has previouslyselected frames for inclusion or exclusion; the device uses thatinformation to predict how the user wishes to select frames forinclusion or exclusion in the future in a way that is consistent withprevious behaviour. More specifically, the information can determine thenumber of frames that the system predicts will be subject to selection.Also, the information held in device memory that is used for frameprediction can be updated whenever the user completes the subsequentselection action.

A graphical user interface may be included: this graphically representsframes and combines those graphically represented frames with agraphical indication of the prediction of which of those graphicallyrepresented frames are to be subject to the subsequent selection action.

Typical operation is as follows: the system predicts the frames that areto be subject to the subsequent selection action after the user hasselected an initial frame. The initial frame is intended to be one ofthe following options: the sole frame to be used; the middle of a clip;the start of a clip; the end of a clip. The user can task or navigatethrough the options by repetitively selecting a button or menu option.Hence, if the user wishes the initial frame to be the middle of a clip,then the system predicts how may frames on either side of the initialframe should be included in the clip, based on previous userinteractions. The user can then readily accept these frames forinclusion into the final clip. The user may also operate the system topredict what frames should be excluded in order to create a clip. Forexample, the user may set the initial frame to be the end of a clip; thesystem then predicts how many future frames should be excluded. Or theuser may set the initial frame to be the start of a clip; the systemthen predicts how many earlier frames should be excluded. In any event,the prediction can be refined by the user manually extending, orreducing the extent of, the predictively selected frames.

BRIEF DESCRIPTION OF DRAWINGS

The present invention will be described with reference to theaccompanying Figures, which illustrate an implementation called VXT.

FIG. 1 shows the allocation of buttons to functions on a typical mobiledevice running VXT, together with the main graphical user interfaceelements.

FIG. 2 illustrates that graphical elements that label the buttons can bevisible or invisible, according to the context.

FIGS. 3, 4, 5, 6 & 7 show the graphics that are superimposed on videoframes to indicate whether they are to be included into, or excludedfrom, the final edit The colouring of the included and excluded regionson the edit bar is indicated on the arrows to the left of the device;these mirror the colouring of the superimposed ‘include’ tick and‘exclude’ cross graphics. In FIG. 4 a single frame (the current one)only is included. In FIG. 5 a region centred on the current frame isincluded. In FIG. 6 all frames from the start of the clip up to thecurrent frame are included. In FIG. 7 all frames from the current onethrough to the end of the clip are included.

FIG. 8 shows the major elements of the VXT system, which consist firstlyof interactions of the user with the Graphical User Interface, secondlyof system tasks carried out by a computer program, and thirdly ofvariables held in computer memory which have the property of persistingbetween invocations of the program.

FIG. 9 shows in more detail how the chosen region of video is refined.

FIG. 10 is an example of a C-language program that executes the systemtasks.

FIGS. 11, 12, 13, 14, 15, 16 & 17 show the debug output from the programof FIG. 10 for various cases illustrative of how the system may be used.In FIG. 11 the predicted region is accepted. In FIG. 12 the region isgrown by using the shuttle forwards or backwards button and thenaccepted. In FIG. 13 a single frame is chosen. In FIG. 14 two iterationsof ‘move’ and ‘grow’ ate used to select a large region from the middlepart of the video clip. In FIG. 15 a large region is selected bychoosing to include all the frames from the start, or end, of the clip.In FIG. 16 the video is trimmed by excluding the start and end regions.In FIG. 17 a selected region is trimmed by excluding a smaller regionfrom the front

DETAILED DESCRIPTION

The invention is implemented in a system called VXT: VXT enables simple,predictive video message preparation, analogous to the predictive textediting for mobile ‘TXT’ing. VXT does not use the conventional editingsemantics of ‘in’ and ‘out’ points; instead, it predictively determinesedit limits using rules that are updated through user feedback It henceminimises the typical number of user interactions required to perform asimple video editing or trimming task.

Briefly, VXT works as follows.

The sequence of actions from the user loading a piece of digital mediato the user applying the edits is called a ‘session’; the firstoperation the user performs during a session is called the ‘initialselection’; subsequent operations that the user performs are called the‘refinement phase’; a frame or frames that are in the final edit are‘included’; those that are not are ‘excluded’, an operation that causesa number of frames to change state from ‘excluded’ to ‘included’ orvice-versa is called a ‘grow’ operation; the actual number of framesthat change state from ‘excluded’ to ‘included’, or vice-versa, during agrow operation is called the ‘support’.

Means are provided for storing, as variables in a computer memory,information about the history of interactions between the user and thevideo preparation tool; these are called ‘session vatiables’ and assistthe user to determine the limits of initial selection, e.g. frames thatare initially to be included or excluded by predictively identifyingthese frames.

In VXT, an integer session variable used for prediction called p is usedautomatically to predictively determine the number of frames labelled as‘included’, as a proportion of the initial length of the clip, when theuser makes the initial selection. When the program is used for the firsttime ever this session variable is set to an arbitrary initial value,for example, 4. If the length of the clip in frames is L then thesupport is given by s=L/p. For example, if s equals 4 and L equals 100then the support s equals 25 frames. Therefore, if the user nominates aparticular frame as being ‘included’, then the system determines that 25frames previous, and 25 frames subsequent, to this frame, may also beincluded. Hence, an edited version of the clip can be rapidly generated.

After an editing session is complete, the actual number of frames (f)included in the final video message is read and is used to derive a newvalue of the session variable used for prediction p as follows:p(new)=2L/f. So, for example, if the length of the final message is 40frames then the new value of p reflects the fact that fewer frames wereactually required than were predicted, and the predicted p for the nextedit session becomes 200/40=5. Assuming an initial length of 100 framesin the next editing session, a support value s equal to 20 frames isused for the next initial selection.

Means are also provided for using and updating the ‘session variables’to assist the user to determine the limits of editing operations thatoccur during the refinement phase by predictively identifying frames tobe included or excluded. These session variables hence reflect thehistory of prior user interactions—i.e. how the user has previouslychosen to edit etc frames.

In the preferred embodiment, a vector of integer variables r(i) is usedto model how the user refines the initial edit; the value of r(i) isequal to the difference in the value of the support variable s betweenthe i−1, and ith refinement edit and is used to predict new values for sduring refinement phases.

Any operation that results in a change of state of a frame from‘excluded’ to ‘included’ is treated as a new edit and causes the index iin r(i) to increment.

A user interacts with a program running in computer memory in order toedit a video clip. The program is able to store and retrieve persistentvariables to and from computer memory, that assists the editingoperation.

Referring to FIG. 8, in the preferred embodiment there are tasks carriedout by the user, tasks carried out by the computer program, andvariables in memory. The initial selection (800) involves the userchoosing a current frame and the system using a stored value (811) tocalculate an initial value for s which is used to create a tentativeregion of frames. The user may press the ‘apply’ button to take thisregion (805) and the region is exported as a new clip (804).Alternatively, the user continues to manipulate the user interface andthe refinement phase (801) is entered. In this phase the user continuesto make adjustments (806) that cause the refinement part of the computerprogram (802) to update the session variables (803) and to adjust thevisual feedback to the user (807). This process iterates until the useris satisfied and chooses to export the result as a new clip (810). Atthis point the system updates the persistent variable p in memory (812).

Referring to FIG. 9 the iterative refinement process operates asfollows. The user operates the include and exclude buttons repeatedly(901), as described below, in order to select a region of frames forinclusion. Stored variables (902) are used to determine the sizes ofblocks of frames added or subtracted during this process. This cycle isended when the user moves to a new current frame at which point thesystem (905) updates the stored variables pertinent to this iteration.The user decides (907) whether or not to take this region; if so therefinement phase ends (908), otherwise it continues in the same mode ofoperation until the feedback from the system (909) is such that the useris satisfied with the result (910) and the process terminates.

FIG. 10 is a example of a program written in the C language for carryingout the described functions. The program essentially consists of a loopthat inputs the user interactions and updates variables that representthe edit points accordingly.

A Graphical User Interface (GUI) input interface for editing is defined;referring to FIGS. 1 and 2; in the preferred embodiment the controlsconsist of five buttons:

-   -   one for video ‘forward’ shuttle;    -   one for video ‘backward’ shuttle;    -   one button meaning ‘include’;    -   one button meaning ‘exclude’;    -   one button meaning ‘apply’.

A Graphical User Interface (GUI) output interface for editing is definedfor feedback to the user.

Referring to FIG. 3; in the preferred embodiment the graphical elementsconsist of:

-   -   an ‘edit bar’ graphic on the display; this comprises a sequence        of coloured rectangular areas.    -   a ‘frame pointer’ that marks the current frame on the edit bar.    -   A ‘frame display’ that shows the current frame and optionally        portions of adjacent frames.    -   an ‘include’ graphic which overlays the corresponding frame        shown in the frame display and consists of a green ‘tick’;    -   an ‘exclude’ graphic which overlays the corresponding frame        shown in the frame display and consists of a red ‘cross’.

Means are provided for the user to select the region of the videomessage that is of interest.

In the preferred embodiment, the user operates the ‘forward’ and‘backward’ shuttle buttons to find a representative frame in the part ofthe clip that is ‘of most interest’. The desired frame is displayed inthe frame display along with smaller, under-sampled versions of theprevious and following frames.

Means are provided to feedback to the user, without the user having topreview the edit, frames that are ‘included’ and ‘excluded’. In thepreferred embodiment the ‘edit bar’ represents the video clip beingedited and a pointer in the ‘edit bar’ indicates the frame currentlybeing viewed. The edit bar is in effect a zoomed out view of the framedisplay with no media content in each rectangular area. It gives contextto the editing operations. Regions of the bar that are green represent‘included’ sections; regions that are red represent ‘excluded’ sections.The colour is indicated next to the vertical arrows to the left of themobile phone. Prior to any editing taking place the bar is completelyred, meaning that all the frames are ‘excluded’.

Means are also provided to feedback to the user, involving the userpreviewing the edit, and frames that are ‘included’ and ‘excluded’.Referring to FIGS. 4, 5, 6 & 7; in the preferred embodiment each frameshown in the frame display that is ‘included’ is overlaid with a green‘tick’ and each frame that is ‘excluded’ is overlaid with a red cross.The user can review these frames using the forward and backward shuttlecontrols.

Means are provided for the user to manipulate the region of the videomessage that is included. The user operates the ‘forward’ and ‘backward’shuttle buttons, ‘include’ button, and ‘apply’ button in order to growregions of the video clip for inclusion in the final edit. Assuming thatthe user has stopped at a frame in a region of interest the interactionis as follows:

-   -   Referring to FIG. 11: If the ‘apply’ button is pressed the        predicted region is exported as a new clip, without further        interactions.    -   Referring to FIG. 12: If the ‘forward’ or ‘backward’ shuttle        buttons are pressed and released at a given frame, followed by        the ‘apply’ button, the included region is extended up to that        frame.    -   Referring to FIG. 13: If the ‘include’ button is pressed once        the part of the edit bar under the frame pointer goes green to        indicate that only the current frame is included; the rest of        the bar remains unchanged.    -   Referring to FIG. 14: If the ‘include’ button is pressed once        more, a region corresponding to the support before and after the        frame pointer position goes green to indicate that this region        is included in addition to the currently included frames; the        rest of the bar remains unchanged.    -   Referring to FIG. 15: If the ‘include’ button is pressed once        more, a region from the start of the bar up to the pointer and a        region corresponding to the support after the frame pointer        position goes green to indicate that all the frames from the        beginning of the video to the current position are included, and        a number of frames after the current position corresponding to        the support are also included.    -   Referring to FIG. 15 again: If the ‘include’ button is pressed        once more, a region from the end of the bar back to the pointer        and a region corresponding to the support before the frame        pointer position goes green to indicate that all the frames from        the current position to the end of the video are included, and a        number of frames before the current position corresponding to        the support are also included.    -   Further presses repeatedly cycle round the four above cases.

The user can also operate two ‘handles’ on the edit bar that define thestart and end of the included region, respectively.

The user can also operate the ‘exclude’ button to grow regions of thevideo clip for exclusion from the final edit. Assuming that the user hasstopped at a frame in a region of interest the interaction is asfollows:

-   -   If the ‘exclude’ button is pressed once then all of the edit bar        apart from that under the frame pointer goes red to indicate        that only the current frame is ‘included’; the rest of the bar        remains unchanged. This is equivalent to the first ‘include’        cycle.    -   Referring to FIG. 16: If the ‘exclude’ button is pressed once        more, a region corresponding to the support at the start and end        of the clip goes red to indicate that these regions are        ‘excluded’; the rest of the bar remains unchanged.    -   Referring to FIG. 17: If the ‘exclude’ button is pressed once        more, a region of size s at the start of the currently included        region goes red to indicate that these frames are ‘excluded’.    -   If the ‘exclude’ button is pressed once more, a region of size s        at the end of the currently included region goes red to indicate        that these frames are ‘exduded’.

Further presses repeatedly cycle round the four above cases.

Means are provided for the user to export the edited video message. Theuser operates the ‘apply’ button to export the edited video message.

Means are also provided for the user to select further options prior tocompletion:

The user can select, through interaction with a menu, the following:

-   -   add ‘fades’ where frames have been deleted.    -   add ‘transitions’ where frames have been deleted.    -   add a background music track    -   add text annotation.

If any editing operation results in a single stationary frame beingdisplayed to the user then this frame can be treated as a still imageand processed separately.

The system monitors the support for the currently displayed frame and,if this is equal to one, asks the user via a message box whether thisframe is required as a still image; if the user replies ‘yes’ then thestill is captured and stored, and the editing session can then proceed.

As a simple example of the use of the invention consider this scenario.Using a built-in camera, a user of a mobile phone captures a shortsegment of video from a birthday party and wishes to trim the segmentThis trimming operation is wanted in order, both to focus in on themoment when the children blow out the candles on the birthday cake, andto minimise the cost of mailing the video segment to friends and family.The video segment is shuttled until the actual frame when the candles goout is displayed. The “include’ button is pressed twice and thepreparation tool, based on the past history of user interaction,determines that three seconds of video before and after the chosen frameshould be included in the edit. The user runs to the start of the‘included’ region and, using the ‘include’ button, adds more frames tothe final edit. The user then quickly runs forward and backward checkingthat green ‘tick’ markers appear in the part of the clip of interest;then the ‘apply’ button is pressed and the editing process is completed.The system measures the actual number of frames set as ‘included’ andupdates the memory variables used for future prediction.

Extensions

The system described above is capable of predicting the frames that areto be subject to a subsequent selection action based on empiricalinformation defining past user behaviour. It is also possible forpredictions to be based on pattern classification applied to the framecontent using fuzzy logic or neural nets or by applying pre-definedrules to meta-data stored with the frames or other kinds of data thatcan be extracted from the frames by suitable processing.

1. A computer based system for selecting digital media frames, thesystem being capable of predicting the frames that are to be subject toa subsequent selection action.
 2. The system of claim 1 in which thesubsequent selection action is the selection of the predicted frames forinclusion to create a new clip.
 3. The system of claim 2 in which thesubsequent selection action is the selection of the predicted frames forexclusion from a new clip.
 4. The system of claim 3 in which the deviceholds in device memory information that defines how a user haspreviously selected frames for inclusion or exclusion; the device usingthat information to predict how the user wishes to select frames forinclusion or exclusion in the future in a way that is consistent withprevious behaviour.
 5. The system of claim 4 in which the informationheld in device memory that is used for frame prediction is updatedwhenever the user completes the subsequent selection action.
 6. Thesystem of claim 4 in which the information determines the number offrames that the system predicts wilt be subject to selection.
 7. Thesystem of claim 1 that graphically represents frames and combines thosegraphically represented frames with a graphical indication of theprediction of which of those graphically represented frames are to besubject to the subsequent selection action.
 8. The system of claim 1 inwhich the system predicts the frames that are to be subject to thesubsequent action after the user has selected an initial frame.
 9. Thesystem of claim 8 in which the initial frame is intended to be one ofthe following options: the sole frame to be used; the middle of a clip;the start of a clip; the end of a clip.
 10. The system of claim 9 inwhich the user can task or navigate through the options by repetitivelyselecting a button or menu option.
 11. The system of claim 1 in whichthe system enables the user to select further actions to be performed onframes; the further actions being selected from the list: annotations;effects; transitions.
 12. The system of claim 1 where the frames arevideo and/or audio frames.
 13. The system of claim 1 that is integratedwith a media player application such that system controls are displayedat the same time as controls for the media player application aredisplayed.
 14. The system of claim 1 wherein the device is selected fromthe following list: laptop computer, mobile PDA with wirelessconnectivity, mobile telephone, set-top box; hard-disc based personalvideo recorders (PVR).
 15. The system of claim 1 in which the frames, ora list of those frames, that have been subject to the subsequentselection action are exported.
 16. The system of claim 1 which iscapable of predicting the frames that are to be subject to a subsequentselection action based on pattern classification applied to the framecontent using fuzzy logic or neural nets or by applying pre-definedrules to meta-data stored with the frames or other kinds of data thatcan be extracted from the frames by suitable processing.
 17. A method ofselecting digital media frames, comprising the step of predicting theframes that are to be subject to a subsequent selection action.